Hands-On Species Distribution Modeling in R

Authors

Federico Maioli (fedma@aqua.dtu.dk)

Brian MacKenzie (brm@aqua.dtu.dk)

Published

October 31, 2024

Species Distribution Models (SDMs) correlate species presence or abundance with environmental factors to define species niches and predict their distribution. In this hands-on tutorial, you’ll fit your first SDM using a real-world dataset from FISHGLOB—a comprehensive compilation of fishery-independent trawl surveys spanning the continental shelves of Europe and North America.

1 Data overview

To fit SDMs you generally need two key ingredients: observational data (which species are found where) and environmental layers.

1.1 Data Sources

Fish data

If you want to know more about the fish data, see the data paper here https://www.nature.com/articles/s41597-023-02866-w

Environmental layers

If you want to know more about the environmental data, see the data layers here https://bio-oracle.org/downloads-to-email.php

1.2 Load and explore the data

NOTE: Throughout the tutorial the R code is hidden, but if you want to see code, click the “Show the Code” button.

Show the code
# load the libraries
library(tidyverse)
library(glmmTMB)
library(rnaturalearth) #install.packages('rnaturalearth')
library(rnaturalearthdata)
library(sf)
library(ggeffects)
library(equatiomatic)
library(patchwork)

# load the data
data = read_csv('../data/model_input/data.csv') 

Data stucture

Show the code
str(data)
spc_tbl_ [93,468 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ haul_id    : chr [1:93468] "NS-IBTS 2010 1 DE 06NI GOV 107 5" "NS-IBTS 2010 1 DE 06NI GOV 107 5" "NS-IBTS 2010 1 DE 06NI GOV 107 5" "NS-IBTS 2010 1 DE 06NI GOV 107 5" ...
 $ year       : num [1:93468] 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
 $ latitude   : num [1:93468] 55.8 55.8 55.8 55.8 55.8 ...
 $ longitude  : num [1:93468] 4.8 4.8 4.8 4.8 4.8 ...
 $ species    : chr [1:93468] "Callionymus_lyra" "Eutrigla_gurnardus" "Gadus_morhua" "Hippoglossoides_platessoides" ...
 $ kg_km2     : num [1:93468] 0 8.359 0.496 0 468.888 ...
 $ present    : num [1:93468] 0 1 1 0 1 0 1 0 1 1 ...
 $ depth      : num [1:93468] 37.7 37.7 37.7 37.7 37.7 ...
 $ bottom_temp: num [1:93468] 9.43 9.43 9.43 9.43 9.43 ...
 - attr(*, "spec")=
  .. cols(
  ..   haul_id = col_character(),
  ..   year = col_double(),
  ..   latitude = col_double(),
  ..   longitude = col_double(),
  ..   species = col_character(),
  ..   kg_km2 = col_double(),
  ..   present = col_double(),
  ..   depth = col_double(),
  ..   bottom_temp = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

What species are present?

Show the code
(species = unique(data$species))
 [1] "Callionymus_lyra"             "Eutrigla_gurnardus"          
 [3] "Gadus_morhua"                 "Hippoglossoides_platessoides"
 [5] "Limanda_limanda"              "Melanogrammus_aeglefinus"    
 [7] "Merlangius_merlangus"         "Merluccius_merluccius"       
 [9] "Microstomus_kitt"             "Pleuronectes_platessa"       
[11] "Trisopterus_esmarkii"         "Trisopterus_minutus"         

Check them out on Worms or FishBase.

Calculate their frequency of occurrence (%)

Show the code
total_hauls <- n_distinct(data$haul_id) # Calculate the total number of hauls

(data %>%
  filter(present == 1) %>%                # Only include present species
  count(species) %>%                       # Count occurrences per species
  mutate(percentage = n / total_hauls * 100) %>%  # Calculate percentage
  arrange(desc(percentage)))                 # Sort by percentage
# A tibble: 12 × 3
   species                          n percentage
   <chr>                        <int>      <dbl>
 1 Merlangius_merlangus          7438       95.5
 2 Limanda_limanda               6850       87.9
 3 Pleuronectes_platessa         6144       78.9
 4 Eutrigla_gurnardus            5915       75.9
 5 Gadus_morhua                  5350       68.7
 6 Hippoglossoides_platessoides  5334       68.5
 7 Microstomus_kitt              5322       68.3
 8 Melanogrammus_aeglefinus      4980       63.9
 9 Trisopterus_esmarkii          3957       50.8
10 Merluccius_merluccius         2867       36.8
11 Callionymus_lyra              2733       35.1
12 Trisopterus_minutus           2437       31.3

Where do we have observations?

Show the code
# extract the coastline

map_data <- rnaturalearth::ne_countries(scale = "large",
                                        returnclass = "sf",
                                        continent = "europe")


north_sea <- suppressWarnings(suppressMessages(st_crop(
  map_data, c(
    xmin = -7,
    ymin = 49,
    xmax = 13,
    ymax = 63
  )
)))

# plot the data points

ggplot(data = data, aes(longitude, latitude)) + geom_point(color =
                                                                     'black', alpha = .3) + 
  geom_sf(data = north_sea, inherit.aes = F) + theme_light() + xlim(-6, 12) + ylim(50, 62) +
  ylab('Latitude') + xlab('Longitude')

How many observations (trawls) per year?

Show the code
data |> distinct(haul_id, year)|>   group_by(year) |> summarise(number_of_trawls=n()) |> ggplot(aes(year,number_of_trawls))+geom_line()+scale_x_continuous(breaks= scales::pretty_breaks())

2 Example - modelling Atlantic cod (Gadus morhua) distribution

Warning

In this exercise, we’re skipping model assumption checks—a bad practice! But with limited time, I’d rather spark your curiosity about these models and focus only on interpretation.

2.1 Map the raw data

Show the code
# add log denisty

data$log_density = log(data$kg_km2+1)

ggplot(data = data |> filter(species=='Gadus_morhua'), aes(longitude, latitude, color = as.factor(present))) + 
  geom_point(size=.5) + 
  geom_sf(data = north_sea, inherit.aes = FALSE) + 
  theme_light() + 
  xlim(-6, 12) + 
  ylim(50, 62) +
  ylab('Latitude') + 
  xlab('Longitude') + 
  scale_color_manual(
    name = "Atlantic cod",
    values = c("0" = "red", "1" = "blue"),
    labels = c("0" = "Absent", "1" = "Present")
  )

Show the code
ggplot(data = data |> filter(species=='Gadus_morhua'), aes(longitude, latitude, color = log_density)) + 
  geom_point(size=.5) + 
  geom_sf(data = north_sea, inherit.aes = FALSE) + 
  theme_light() + 
  xlim(-6, 12) + 
  ylim(50, 62) +
  ylab('Latitude') + 
  xlab('Longitude') + scale_color_viridis_c('Atlantic cod log(density)')

2.2 Fit a simple linear model with temperature

Show the code
my_SDM = lm(log_density ~ 1 + bottom_temp,
                 data = data |> filter(species=='Gadus_morhua'))

extract_eq(my_SDM, intercept = "beta", show_distribution = TRUE) # More mathy

\[ \begin{aligned} \operatorname{log\_density} &= \beta_{0} + \beta_{1}(\operatorname{bottom\_temp}) + \epsilon \end{aligned} \]

Show the code
summary(my_SDM) # Let's inspect the params estimates

Call:
lm(formula = log_density ~ 1 + bottom_temp, data = filter(data, 
    species == "Gadus_morhua"))

Residuals:
    Min      1Q  Median      3Q     Max 
-3.7657 -2.0719 -0.1714  1.9502  8.2234 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.26521    0.17487   41.55   <2e-16 ***
bottom_temp -0.50767    0.01941  -26.16   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.295 on 7787 degrees of freedom
Multiple R-squared:  0.08076,   Adjusted R-squared:  0.08064 
F-statistic: 684.1 on 1 and 7787 DF,  p-value: < 2.2e-16

Can you interpret the intercept?

How do we interpret the slope? With a increase of 1 degree, log density increases/decreases by …

Show the code
partial_effect = predict_response(my_SDM, terms = "bottom_temp")

plot(partial_effect, ci_style = "dash")

Group discussion (10 minutes)

Do you think this approach is reasonable, or are we missing something? Can you think of any alternatives to a linear model for describing a species’ relationship with temperature?

2.2.1 Predict, predict, predict!

Load and inspect the prediction grid:

Show the code
grid = read_csv('../data/model_input/grid.csv') # load the grid

str(grid)
spc_tbl_ [33,596 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ longitude              : num [1:33596] -1.1 -0.85 -0.8 -0.75 -0.7 -0.65 -0.6 -0.55 -0.5 -0.45 ...
 $ latitude               : num [1:33596] 49.4 49.4 49.4 49.4 49.4 49.4 49.4 49.4 49.4 49.4 ...
 $ depth                  : num [1:33596] 28.14 0 1.31 9 12.83 ...
 $ bottom_temp_now        : num [1:33596] 12.7 13.2 13.2 13 12.9 ...
 $ bottom_temp_2100_ssp119: num [1:33596] 12.7 13.3 13.3 13.1 13 ...
 $ bottom_temp_2100_ssp585: num [1:33596] 15.2 16.1 16.1 15.9 15.7 ...
 - attr(*, "spec")=
  .. cols(
  ..   longitude = col_double(),
  ..   latitude = col_double(),
  ..   depth = col_double(),
  ..   bottom_temp_now = col_double(),
  ..   bottom_temp_2100_ssp119 = col_double(),
  ..   bottom_temp_2100_ssp585 = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

Predict the ‘now’ scenario

Show the code
grid$log_density = predict(my_SDM, grid |> mutate(bottom_temp = bottom_temp_now))

p1 = ggplot(data = grid, aes(longitude, latitude, fill = log_density)) + geom_raster() +
  scale_fill_viridis_c("Atlantic cod log(biomass)"
  ) + geom_sf(data = north_sea, inherit.aes = F) + theme_light() + xlim(-6, 12) +
  ylim(50, 62)+ggtitle('Now')
p1

Predict the future (2100) scenario SSP5-8.5

Show the code
grid$log_density = predict(my_SDM, grid |> mutate(bottom_temp = bottom_temp_2100_ssp585))

p2 = ggplot(data = grid, aes(longitude, latitude, fill = log_density)) + geom_raster() +
  scale_fill_viridis_c("Atlantic cod log(biomass)"
  ) + geom_sf(data = north_sea, inherit.aes = F) + theme_light() + xlim(-6, 12) +
  ylim(50, 62)+ggtitle('2100 SSP5-8.5')
p2

Compare them:

Show the code
p1 + p2 + plot_layout(guides = 'collect') & scale_fill_viridis_c(limits = range(c(grid$log_density)))  & theme(legend.position="bottom", legend.box = "horizontal")

Group discussion (30 minutes)

Is this what you expected?

Just eyeballing it, based on this model, will the (log)density change a lot under the future scenario, or is it comparable?

Can you think of one simple calculation for comparing the current vs future density?

In this example, we only considered bottom temperature as a driver of species distribution. Can you think of a couple more environmental variables that can influence fish densities?

2.3 A slightly more complex model

What if we want to predict Atlantic cod presence/absence? Well, in this case we need to leave the comfort of linear models and embrace Generalized Linear Models (GLMs).

2.3.2 In practice

Show the code
my_lr_SDM <- glm(present ~ 1 + bottom_temp,
                 data = data |> filter(species=='Gadus_morhua'),family = binomial(link = "logit")
)

extract_eq(my_lr_SDM, wrap = TRUE,intercept = "beta", show_distribution = TRUE)

\[ \begin{aligned} \operatorname{present} &\sim Bernoulli\left(\operatorname{prob}_{\operatorname{present} = \operatorname{1}}= \hat{P}\right) \\ \log\left[ \frac { \hat{P} }{ 1 - \hat{P} } \right] &= \beta_{0} + \beta_{1}(\operatorname{bottom\_temp}) \end{aligned} \]

Show the code
partial_effect <- predict_response(my_lr_SDM, terms = "bottom_temp") # this function is smart, it's already on the response scale

plot(partial_effect, ci_style = "dash")+ggtitle('')+ylab('Predicted probability of occurence')

Wait it’s not linear? What is the interpretation of the slope then?

Show the code
summary(my_lr_SDM)

Call:
glm(formula = present ~ 1 + bottom_temp, family = binomial(link = "logit"), 
    data = filter(data, species == "Gadus_morhua"))

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  5.54108    0.18151   30.53   <2e-16 ***
bottom_temp -0.52556    0.01964  -26.75   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9683.1  on 7788  degrees of freedom
Residual deviance: 8886.5  on 7787  degrees of freedom
AIC: 8890.5

Number of Fisher Scoring iterations: 4

The slope is -0.526.

Well, the relationship is not linear on the probability scale (i.e. response). The logit-transformed true probabilities follow the estimated intercept and slope.

A unit increase in bottom_temp corresponds to a -0.526 increase in the log odds of a 1 (species present) being observed.

I took this from https://github.com/seananderson/glmm-course/blob/master/12-glms.Rmd

logit <- function(x) qlogis(x)
inverse_logit <- function(x) plogis(x)

x<- seq(-10, 10, length.out = 100)

plot(x, inverse_logit(x), type = "l")

plot(x, logit(inverse_logit(x)), type = "l")

If we exponentiate the slope coefficient we get the expected fold increase (decrease) in the odds of observing a 1 (the species being present): 0.59 per unit increase in bottom_temp.

See also here.

A quick trick is to take the slope of the logistic regression and divide it by 4. This will give you approximately the expected change in probability per unit change in the variable at the steepest part of the line: -0.13.

3 Now it’s you turn!

Open the script 02_fit_my_model_template.R under ‘exercise’ and fit your model(s).

4 Suggested lectures

4.1 SDMs (with sdmTMB package)

4.1.1 Introductory

Show the code
pacman::p_load("vembedr")

embed_url("https://www.youtube.com/watch?v=DIXa7ngVVL0") %>%
  use_bs_responsive()

4.1.2 Advanced topics

Show the code
embed_url("https://www.youtube.com/watch?v=VxnqgiAAjfk") %>%
  use_bs_responsive()

4.2 Stats

“Statistical Rethinking” is one of the nicest book out there for stats. Also lectures are available online. It’s long (20 lectures 1h 15min each) but absolutely worth every bit of effort.

Show the code
embed_url("https://www.youtube.com/watch?v=FdnMWdICdRs&list=PLDcUM9US4XdPz-KxHM4XHt7uUVGWWVSus&index=1") %>%
  use_bs_responsive()

5 Suggested books

5.1 SDMs

  • Ovaskainen O., & Abrego N. (2020). Joint Species Distribution Modelling: With Applications in R. Cambridge: Cambridge University Press.

  • Thorson, J., & Kristensen, K. (2024). Spatio-Temporal Models for Ecologists (1st ed.). Chapman and Hall/CRC. https://doi.org/10.1201/9781003410294

5.2 Stats

  • Gelman, A., Hill, J., & Vehtari, A. (2020). Regression and Other Stories. Cambridge: Cambridge University Press.

  • McElreath, R. (2020). Statistical Rethinking: A Bayesian Course with Examples in R and STAN (2nd ed.). Chapman and Hall/CRC. https://doi.org/10.1201/9780429029608